Fast Class-Attribute Interdependence Maximization (CAIM) Discretization Algorithm
نویسندگان
چکیده
Discretization is a process of converting a continuous attribute into an attribute that contains small number of distinct values. One of the major reasons for discretizing an attribute is that some of the machine learning algorithms perform poorly with continuous attribute and thus require front-end discretization of the input data. The paper describes a Fast Class-Attribute Interdependence Maximization (F-CAIM) algorithm that is an extension of the original CAIM algorithm. The algorithm works with supervised data by maximization of the classattribute interdependence. The F-CAIM’s improvement of the CAIM algorithm is significant shortening of the computational time required to discretize the data. It has all CAIM’s advantages like fully automated generation of possibly minimal number of discrete intervals, achieving the highest class-attribute interdependency when compared with other discretization algorithms, and improving performance of machine learning algorithms that are subsequently used on the discretized data. We present the results based on extensive benchmarking tests of F-CAIM, CAIM and six other state-of-the-art discretization algorithms. The tests use eight wellknown machine learning datasets consisting of continuous and mixed-mode attributes. They show that the F-CAIM’s speed is comparable to the speed of the simplest unsupervised algorithms and better than these of other supervised discretization algorithms.
منابع مشابه
Discretization Algorithm that Uses Class-Attribute Interdependence Maximization
Most of the existing machine learning algorithms are able to extract knowledge from databases that store discrete attributes (features). If the attributes are continuous, the algorithms can be integrated with a discretization algorithm that transforms them into discrete attributes. The paper describes an algorithm, called CAIM (class-attribute interdependence maximization), for discretization o...
متن کاملA Novel Tree Based Classification
Classification is a data mining (DM) technique used to predict or forecast the unknown information using the historical data. There are many classification techniques. ID3 is a very popular tree based classification algorithm for a categorical data which does not support continuous data. Attribute selection process plays major role in building a classification tree model. Attribute Selection in...
متن کاملA Discretization Algorithm for Uncertain Data
This paper proposes a new discretization algorithm for uncertain data. Uncertainty is widely spread in real-world data. Numerous factors lead to data uncertainty including data acquisition device error, approximate measurement, sampling fault, transmission latency, data integration error and so on. In many cases, estimating and modeling the uncertainty for underlying data is available and many ...
متن کاملThe Interaction of Entropy-Based Discretization and Sample Size: An Empirical Study
An empirical investigation of the interaction of sample size and discretization – in this case the entropy-based method CAIM (Class-Attribute Interdependence Maximization) – was undertaken to evaluate the impact and potential bias introduced into data mining performance metrics due to variation in sample size as it impacts the discretization process. Of particular interest was the effect of dis...
متن کاملExperiments with Decision Tree Classifiers – Discretization of Numerical Attributes
Classification algorithms are used in numerous applications everyday, from assigning letter grades to student student’s scores, to computerized letter recognition in mail processing. Discretization consists of applying a set of rules to reduce the number of discrete intervals from which an attribute is assigned. Discretization is generally applied to datasets whose numerical range consists of c...
متن کامل